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Abstract 



Most cognitive architectures rely on discrete representation, both in space 
(e.g., objects) and in time (e.g., events). However, a robot interaction with 
the world is inherently continuous, both in space and in time. The segmen- 
tation of the stream of perceptual inputs a robot receives into discrete and 
meaningful events poses as a challenge in bridging the gap between inter- 
nal cognitive representations, and the external world. Event Segmentation 
Theory, recently proposed in the context of cognitive systems research, sus- 
tains that humans segment time into events based on matching perceptual 
input with predictions. In this work we propose a framework for online event 
segmentation, targeting robots endowed with active perception. Moreover, 
sensory processing systems have an intrinsic latency, resulting from many 
factors such as sampling rate, and computational processing, and which 
is seldom accounted for. This framework is founded on the theory of dy- 
namical systems synchronization, where the system considered includes both 
the robot and the world coupled (strong anticipation). An adaption rule is 
used to perform simultaneous system identification and synchronization, and 
anticipating synchronization is employed to predict the short-term system 
evolution. This prediction allows for an appropriate control of the robot 
actuation. Event boundaries are detected once synchronization is lost (sud- 
den increase of the prediction error). An experimental proof of concept of 
the proposed framework is presented, together with some preliminary results 
corroborating the approach. 



Keywords: Event segmentation, anticipative systems, active perception, 
cognitive robotics. 



Contents 

1 Introduction [2] 

2 Related work H 

3 Strong anticipation 3] 

4 Adaptive synchronization US 

5 Event segmentation [7] 

6 Experimental results [TU] 

7 Conclusions and future work [T5] 



1 



1 Introduction 



The perception of a robot is grounded on the physical world. Its sensors 
receive a continuous stream of information, as for instance the light pat- 
terns hitting the CCD sensor of a video camera. Cognitive representations, 
however, are often discrete, as in the case of events and objects. Although 
the usage of digital computers demand that all sensory information is dis- 
cretized, this discretization is commonly performed in fixed, not always ad- 
justable, discretization step (e.g, the frame rate and the pixel resolution of 
a video camera). The detection of meaningful events from a stream of sen- 
sory information is an important challenge, from the point of view of the 
design of a cognitive architecture for robots, contributing to bridge the gap 
between a continuous time world and discrete time, event-based cognitive 
representations. 

The segmentation of a continuous stream of information into events is 
often overlooked, being commonly performed in an ad-hoc manner, either 
recurring to threshold values over heuristic functions, or fixed time triggers, 
for instance. But these methods are mostly sensor modality dependent, as 
well as task specific. This work addresses the problem of bridging the gap 
between the time continuous stream of sensory /actuation information, and 
the discrete time sequence of cognitive representations, proposing a modality 
and task independent framework for event segmentation. 

This problem is addressed using a biologically inspired approach. Under 
this paradigm, our goal is not to faithfully model any aspect of the human 
brain, but rather to employ findings from neuroscience capable of providing 
guidance on how to engineer better systems. 

The Event Segmentation Theory (EST) provides a model of how the hu- 
man brain segments perception into a sequence of events [THJ, [7] . This model 
sustains that event segmentation is based on the detection of prediction er- 
rors in the sensory stream. Prediction is a commonplace mechanism found in 
many brain systems. In particular, the human brain is permanently making 
predictions and comparing them with the actual outcome [12]. Events are 
detected whenever a significant disparity between prediction and outcome is 
encountered. An event segmentation mechanism can be built following this 
principle, but the problem of how to make predictions about perceptions has 
to be addressed first. 

Dubois distinguishes between strong and weak anticipation [3j [13]: the 
latter is based on an explicit model of the world, where the physics is encoded 
in analytical constructs, that can be mathematically solved given an initial 
condition. On the contrary, strong anticipation does not rely on a model, 
but rather on the dynamical evolution of the interaction of the agent with 
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the world, seen as a single system. An example of strong anticipation can be 
found on the behavior of an outfield baseball player when catching a well- 
struck balQ weak anticipation of the ball landing position requires modeling 
the physics of the ball, encoding the initial state of the system (initial veloc- 
ity, mass, friction coefficient, etc), and then predicting the landing position 
by solving the analytical model; in contrast, strong anticipation views the 
outfielder and the ball as a single system with new dynamics, as the outfielder 
moves itself driven by the projection of the ball on his retina. Empirical ev- 
idence suggest that this is the way an human outfield player functions [13]. 
In the context of robotics, a model based approach to anticipation may be 
appropriate for passive sensors, but when designing systems that actively 
engage in interactions with the world, as in the case of active perception, the 
world can no longer be modeled as an independent, self-contained system. 

Stepp proposes an approach to strong anticipation based on the work de- 
veloped in the field of chaotic systems concerning synchronization of dynam- 
ical systems [13]. Consider two systems, denoted D (drive) and R (response), 
connected by a unidirectional flow of information from D to R. It is possible 
to design the system R such that its dynamic evolution synchronizes with 
the one of D, regardless of the initial condition of each system. One way of 
doing this is for the R system to compare its state with the one of the D, 
and bias its dynamics accordingly, i.e., system R is controlled by a feedback 
loop, where the error results from this comparison. More interestingly, if this 
feedback loop contains a delay, system R is capable, under certain conditions, 
to anticipate system D [TI] . Considering that system D includes both the 
robot and the world, and system R to be a model internal to the robot, this 
approach suggests an interesting mechanism to perform strong anticipation 
of the dynamical evolution of the world-robot system. 

One problem remains to be solved: how to design system R? No system 
model is assumed a priori, since it depends on the coupling involving the 
robot and the world. A possible approach is to adapt system R during in- 
teraction. A solution to the adaptation of response systems in the context 
of dynamical systems synchronization has been proposed by Chen [1] , where 
the convergence to the solution has been proved using the Lyapunov sta- 
bility theory. This result does not directly apply, however, to anticipating 
synchronizat ion . 

The contributions of this work are: 

• An event segmentation method based on Stepp's strong anticipation 
concept [13], cast as an anticipating system synchronization framework; 

1 Example from [13j . 
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• The application of Chen's parameter identification method [T] to an- 
ticipating synchronization; 

• A proof-of-concept implementation of an architecture for event segmen- 
tation and active perception, employing these methods. 

This report is organized as follows: after a short section surveying related 
work, two sections on the theoretical background behind strong anticipation 
and the adaptation method to learn the response system R follow. Then, the 
proposed architecture for event segmentation is described, followed by some 
experimental results of a proof of concept implementation of these ideas. A 
section presenting some conclusions and open questions closes the report. 



2 Related work 

The problem of event segmentation has been studied in the past. See [10] for 
a review of recent techniques for the formation of event memories in robots. 
Ramoni et al. proposed a method to cluster robot activities using Markov 
chain models In pE] a batch maximum likelihood estimator is used to 
fit a sequence of time-indexed models to raw data. The incremental ver- 
sion of this algorithm is based on thresholding the likelihood of the current 
model along time. The spatio-temporal segmentation of video have been 
researched in [T5], applying motion model clustering, and in [2] using hier- 
archical clustering of the 3D space-time video stream. Gesture segmentation 
and recognition has been addressed in [6] employing hidden-Markov models 
(HMM). 



3 Strong anticipation 



In [13] strong anticipation is modeled using a dynamical system synchroniza- 
tion framework. Consider two continuous dynamical state vectors x(t),y(t) G 
W 1 with the following coupled dynamics: 

A = /O) 

y = f(y) + k(x - y T ) 

where y T = y(t — r), i.e., a feedback loop with a constant delay r, and k 
is a scalar gain. The first system is called the drive (D) while the second 
the response (R). This delayed feedback loop in the response system is a 
fundamental aspect, and is responsible for the response system capability of 
anticipating the trajectory of the drive. 
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This delayed feedback loop is neurophysiologically supported by the dis- 
covery of forward models in the brain, which predict sensory consequences 
of motor commands [HI El |S]. These models receive as input a copy of the 
subject motor action, and produce a prediction of future perceptions. For 
instance, when performing an arm movement, these models predict the tra- 
jectory followed by the arm, as perceived by the subject. One important 
function of this mechanism is to overcome the sensory processing latency in 
the brain, when the subject is performing controlled, quick movements. 

To understand how the response system can anticipate the drive, consider 
that r = and that the systems are synchronized at time to, i.e., a; (to) = 
y(to). Under these conditions, the systems will remain synchronized, since 
x — y T = and thus there is null feedback in the response. In this case, the 
concatenated state z = (x,y) G M. 2n evolves in the x = y hyperplane, called 
the synchronization manifold [9J. The response system synchronizes with 
the drive if the error system with state e = y — x, also called the transversal 
system 



is able to reject the perturbation e, driving it to zero. For f(y) ~ f{%), 
system (j2]) behaves like a first-order system with an exponential decay to 
zero. Anticipation is realized once r > 0, as synchronization implies x(t) = 
y T = y{t — r) and thus y(t) = x(t + r), meaning that the response anticipates 
the driver. This is called anticipating synchronization [14], where x = y T 
defines the anticipatory manifold [15] . 

Successful synchronization from an arbitrary initial condition is not guar- 
anteed in general (unless for simple cases), and strongly depends on the values 
of k and r. However, for any delay value r, e(t) = is a fixed point of the 
transversal system meaning that once synchronized, the system will re- 
main so. Voss conjectures that, if e(t) = is a stable fixed point for r = 0, 
then there is a r > such that, for any < r < r , the transversal system 
has a stable fixed point at e(t) = 0. This conjecture has been backed up by 
numerical simulations [15J. 

In general, for sufficiently small r, stability of the transversal system 
can be expected. In the case of this work, since r models the delay of the 
perceptual system (e.g., the latency from a change in the environment up to 
its detection by the computer vision algorithm), this delay can be assumed 
smaller than the time scale of the events being perceived. 




(2) 
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4 Adaptive synchronization 



In the previous section it was assumed that the dynamics of the drive and 
response systems are equal. If the drive system corresponds to the world- 
robot coupled system, its dynamics is not known a priori. One way of tackling 
this problem is to adapt the response system, online, during synchronization. 

Chen proposes in pQ an approach to adapt response systems in the context 
of dynamical system synchronization. It does not account, however, for a 
delayed feedback. 

Consider that the drive system has the form 

x = f(x) + F{x)9 (3) 

where 9 G lR m is a vector of (constant) parameters, f(x) G W 1 and F(x) G 
l" xm . The response system is identical, except for the parameter vector that 
is unknown, and for the synchronization feedback loop 

y = f(y) + F(y)a + U(y,x,t,a) (4) 

where a is the response parameter vector, and U(y, x, t, a) is called the con- 
troller of the response. Chen et al. proved in [I] that, under certain con- 
ditions, not only the response system synchronizes with the drive, but also 
that the response parameters a converge to the ones of the drive 9, i.e., 

lim \\a(t)-9\\ = 0. (5) 

t— >+oo 

These conditions consist of the existence of a smooth controller U (y, x, t, 9) 
and of a scalar (Lyapunov) function V(e), where e = y — x, such that: 

1. CiHell 2 < V(e) < c 2 ||e|| 2 , 

2. the derivative of V(e) along the solution of the coupled system 



x = f(x)+F(x)9 

y = f(y)+F(y)9 + U(x,y,t, 



(6) 



satisfies V(e) < — W(e), and 
3. the parameter vector a is adapted according to the learning rule 

a(t) = -F T (x) [W(e)f (7) 
for VV(e) denoting the gradient (row) vector of V with respect to e, 
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where c% and C2 are two positive constants, W (e) is a positive definite func- 
tioiQ and U(y,y,t,9) = 0. 

This result has two important consequences: first, it proves convergence, 
provided that the response system is capable of synchronizing with the driver 
if a = 9 (i.e., if the true parameters were known), and second, it provides a 
learning law, in the form of the gradient of a. However, in order to use this 
result, one has to find a controller U and a function V satisfying the premises 
of the theorem. Chen shows that the controller 

U(y, x, t, 9) = -e + f(x) - f(y) + [F(x) - F{y)\ 9 (8) 

and the Lyapunov function 

V(e) = l -e T e (9) 

satisfy the premises for any F and /. 

The practical application of these results raises three practical issues. 
One is the assumption that functions F and / are known, meaning that 
one should have a prior knowledge of the structure of the dynamics of the 
system. One can reverse this argument, stating that, given functions / and 
F sufficiently generic, this method allows the adaptation to any dynamical 
system that can be modeled by ^ for some parameter vector 9. Second, 
this result was proved for continuous time systems. The discretization of a 
raises the issue of the choice of a learning rate (hidden in a proportionality 
constant of V, since the theorem is invariant to a change of scale of this 
Lyapunov function). Finally, the third issue concerns hidden state variables: 
if there is a state variable that is hidden, i.e., the Lyapunov function V(e) 
does not depend on its error, then this function is no longer positive definite. 
This requires that all drive state variables have to be fed to the response 
system controller. This is mostly true^] once the state variables considered 
are all obtained from perception (as in the case of the outfield baseball player 
example above). 

5 Event segmentation 

The event segmentation framework we propose in this work, depicted in 
Figure [TJ consists of a pair of response systems, one performing adaptation 
(labeled adaptive response), and the other anticipation (labeled anticipating 

2 W(0) = and W(e) > for any e^O. 

3 Occlusion of objects by others have to be accounted for. 
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Figure V. System architecture, consisting of the drive system and the percep- 
tual delay (world), and the double response system formed by the adaptive 
and the anticipating responses. The anticipating response uses the param- 
eters a obtained by the adaptive response. The control input u is obtained 
by a controller fed with the anticipated state y. 

response). The adaptive response learns the parameter vector a as described 
in the Adaptive synchronization section, while anticipating response performs 
anticipating synchronization as explained in the Strong anticipation section. 
The robot-world coupled system is modeled by the controlled drive system. 
Note that the access of the architecture to the world state is subject to 
a delay, modeling for instance the latency of the perceptual channel (image 
acquisition, processing, and tracking). The controller computes the actuation 
vector u based on the anticipated world state y. 

The drive system, together with the perceptual delay, is modeled by the 
dynamical system 

x = f(x) + F{x)0 + u (10) 

where u is the control input, modeling the actuation of the robot in the 
world. Shifting this equation by a delay of r one obtains 

x T = f(x T ) + F(x T )6 + u T (11) 

where u T (t) = u(t — r). This model can be put in the form of ^ defining a 
time varying function 

fr(Xr,t) = f(x T )+U T (12) 
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from which x T = f T (x T ,t) + F(x T )6. The adaptive response receives the 
delayed state x T , together with the delayed control input u T 



y* = f(y*) + F(y*)a + u T + U(y*,x T , t, a) (13) 

Once f T (y*,t) = f(y*)+u T , this equation can be put in the form of Q. The 
anticipating response is described by 

y = f(y) + F(y)a + u + k(y r - x T ) (14) 

where y T = y(t — r) as before, and the parameter vector a equals the one ob- 
tained by the adaptive response. The anticipatory synchronization manifold 
is defined by y T = x T . Thus, y = x, meaning that the anticipating response 
is synchronized with the drive system, which is the same to say that it is 
anticipating the delayed perception x T . By shifting (12) in time one can get 



f T (y,t + r) = f(y) + u, allowing us to write (11) and (14) as 

x T = f T (x T ,t) + F(x T )6 ^ 
y = fr(y,t + T) + F(y)a + k(y T - x T ) 

thus matching (except for the time varying dynamics, which do not affect 
the previous considerations on anticipating synchronization) when a = 9. 

According to the theory of Event Segmentation [TH], perceptual sys- 
tems continuously make predictions about perceptual input, and perceive 
event boundaries when transient errors in prediction arise. On the adaptive 
synchronization framework, the Lyapunov function V(e) defined in ([9]), for 
e = y* — x T provides a solid estimate of the prediction error. Considering 
the function values in a time window, we can associate the obtained samples 
with a random variable with Normal distribution of mean fiy and variance 
<jy. Under this assumption, the normalized metric 

b v = (16) 

is normally distributed with zero mean and unit variance. When \by\ ex- 
ceeds a threshold 6 ev ent, an event boundary is detected. If by is normally 
distributed with zero mean and unit variance, the cumulative probability of 
the distribution tails for \by\ > b event is the probability of false positive detec- 
tion. Thus, feevent should be sufficiently high so that false positive detection 
is minimized, but low enough in order to detect the prediction error increase 
due to a sudden change in the dynamics of the system. 
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Figure 2: Simulated scenario, where 0i = (3% = n/12. 



6 Experimental results 

As a proof of concept for the ideas presented here, a simple scenario was 
simulated: a ball rolling free on a series of inclined planes, with different 
slopes, is observed by a robot camera which aims to follow it, in order to 
center it on the image, as depicted in Figure [2] The camera moves parallel 
to the plane, for simplicity sake. 

Denoting the ball coordinates by v = \v\ V2\ T and the camera coordinates 
by c = [ci c 2 ] T , the ball projection x = \x\ X2\ T in the image plane is assumed 
orthographic: x — v — c. Assuming that there is no ground friction, the 
dynamics of the ball is a double integrator 

Vi = —q sin 8 cos 8 , 

.20 ( 17 ) 
v 2 = —g sin 8 

Considering that the camera support is frictionless and that its movement is 
controlled in acceleration (i.e., force control), the resulting drive system, in 
state space form, is given by 

Xi = — g sin 8 cos 8 
'±2 = —g sin 2 8 — c 2 

X\ = X\ 

x 2 = x 2 

considering the state vector x = \x\ x 2 X\ x 2 ] T . This system can be put in 



- Ci 
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the form (10) once 



/(*) 






i'i 

x\ 

—g sin (3 cos (3 
—g sin 2 f3 



F(x) 



1 

1 





-ci 
-c 2 







(19) 



For this proof of concept, we set the response system to be structurally 
identical, thus employing the same functions / and F, and control input u. 
The vector a = [act a 2 ] T is the parameter vector to be adapted according to 
Chen's learning rule ([7]). 

When the anticipating response is synchronized with the drive, we have 
y, and thus the dynamics of the anticipating response becomes 



y = a — c. 

The camera motion controller considered has the form 



(20) 



(21) 



where k p and k d are the proportional and the derivative gains of the controller. 
Thus, the closed loop dynamics becomes 



y 



-k p y - k d y 



(22) 



The design of the controller gains k p and k d can be performed by pole place- 
ment (in the experiments we set k 2 d = 4k p , yielding a smooth response with 
a double pole at —k d /2). 

The experiments were conducted after discretizing the above equations 
using a simple approximation z(t) ~ [z(t + T) — z(t)]/T. The sampling 
rate was 100Hz, k p — 1, k d — 2, k — 1, and the Lyapunov function used 
was ([9]). The delay considered was r = 0.65s (65 samples). Event boundaries 
are detected using a 10-second window and a 6 eve nt = 3. The system is 
initialized with the ball starting on the top left position of the ramp, and 
as the ball transverses the scenario there are two events, corresponding to 
the two changes of the ramp slope. Each simulation takes 100s of simulated 
time. 

Figure [3] attests the performance of the adaptive response system, in 
terms of the evolution of the parameters a, compared with the ground truth 
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Figure 3: Parameters a evolution (solid line) in comparison with the true 
values (dashed line). 



(9, that changes with the slope). As can be seen, the parameter vector a 
converges to the true parameters 9 after some time. 

Figure [4] shows the evolution of the ball position in the camera without 
an anticipating response system, i.e., the camera motion controller is fed by 
y* instead of y. As expected, the delay introduced by the latency of the 
perceptual channel jeopardizes the control of the camera. Also, the adaptive 
response follows the drive with a delay of r. 

Figure [5] compares the ball position in the camera with its anticipated 
response. In this case, both are synchronized, since the ball coordinates in 
the image converge to zero (except for a brief time after each slope change, 
while the adaptive system learns the new parameters). Also, the anticipating 
response makes it possible to control the drive system satisfactorily. 

Figure [6] pictures the evolution of the prediction error estimate V(e). Its 
value approaches zero as the drive and response system become synchronized. 

Finally, Figure [7] shows the event segmentation results obtained using the 
normalized metric ( Jl6| , with a window of 10s. As expected, each change of 
plane is detected as an event boundary by the framework. Interestingly, the 
peak of this metric, at the event boundary, increases with the window size, 
without any loss of temporal resolution. 

These results show that the proposed system is capable of correctly (1) de- 
tecting the event boundaries that correspond to the change of ramp slope 
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Figure 4: System response without anticipation. 




Figure 5: System response using the full architecture. 
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Figure 6: Prediction error V for the first 10 seconds of the simulation. 
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Figure 7: The ball V2 coordinate evolution along the experiment: top plot 
shows the detected events, and the bottom plot a zoom around the first 
detected event. The delay observed in this second plot corresponds to the 
perceptual delay r. 
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by the ball, (2) controlling the camera movement using anticipation, and 
(3) learning the correct system parameters. 

7 Conclusions and future work 

This report describes an event segmentation framework, targeting active per- 
ception in robots, based on the concept of strong anticipation proposed by 
Stepp et al. in [13] . A dynamical system synchronization paradigm is used as 
theoretical foundation of the proposed architecture, where the robot-world 
coupled system is identified using a parametric method for adaptation pro- 
posed by Chen et al. in pQ, and the actuation is performed using anticipation. 
This anticipation accommodates for the net delay of the perceptual channel. 
The capability of the architecture to anticipate perception allows the robot 
to control its actuation based on the prediction of the robot-world state, 
instead of relying on the delayed perceptual data. 

Having the described proof of concept experiments shown that the pro- 
posed architecture behaves as expected, future work includes scaling this 
approach to more complex domains. This involves tackling the issues of the 
learning rate, which is hidden in the proportionality constant of the Lyapunov 
function, used in the Chen's learning rule, as well as the automatic design 
of the controller, given the adapted parameters. Other open questions in- 
clude dealing with hidden state variables, as well as complex relations among 
objects (e.g., grasping, occlusion, and so on). 
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